Chapter 15 · Full-Text Search

Full-Text Search & Elasticsearch

From a simple ILIKE query that suffocates under millions of rows, to relevance-ranked millisecond search — this chapter explains why tools like Elasticsearch exist, how the inverted index works, what BM25 does, and when to reach for Postgres FTS versus Elasticsearch in a real backend.

01

The Origin Story — Why Search Got Hard

Imagine it's 2005. You're a backend engineer at a fast-growing e-commerce startup. The product catalog has ~5,000 items. You write a search endpoint using a classic SQL pattern:

SQL-- Works great at 5 000 rows
SELECT * FROM products
WHERE  name        ILIKE '%laptop%'
   OR  description ILIKE '%laptop%';

This returns results in roughly 50 ms. Users are happy, managers are happy, life is simple.

Fast-forward two years. The catalog explodes to millions of products. The same query now takes 30 seconds. The company is losing sales. On top of the speed issue, new requirements arrive:

These three requirements — speed, relevance, typo-tolerance — are exactly the founding charter of dedicated full-text search engines.
02

The Librarian Analogy

Think of your relational database (Postgres, MySQL, etc.) as a librarian in a giant library. The librarian knows exactly where every book lives on the shelf.

But it has one fatal flaw: to find books about a topic, it must physically walk to every single shelf, pull out every book, and read it page by page looking for your keyword.

User Query "machine learning" Postgres ILIKE Scan scans EVERY row Harry Potter Game of Thrones Intro to ML ✓ Cooking Deep Learning ✓ Fiction Scans every book one by one — O(n) Two Fatal Flaws: ① Slow — O(n) scan ② No relevance order
Fig 1 — Postgres ILIKE performs a full sequential scan: every row is examined individually.

The two problems summarised:

  1. Speed: A table with 50 million rows means 50 million row comparisons. On spinning disk this was catastrophic even in the SSD era it wastes enormous I/O.
  2. No relevance: A book whose title is "Introduction to Machine Learning" ranks identically to a book that merely mentions "machine" once in the appendix. The database has zero concept of importance.
03

Why ILIKE '%term%' Cannot Use Indexes

Postgres B-tree indexes work by sorting values. A B-tree can locate name LIKE 'lapt%' (prefix match) efficiently because sorted strings share prefixes. But ILIKE '%laptop%' has a leading wildcard — there is no useful prefix to sort on. The engine must fall back to a sequential scan.

A leading % in a LIKE/ILIKE expression disables B-tree index usage. Postgres will read every single page of the heap.

What the query plan looks like

SQL — EXPLAIN ANALYZEEXPLAIN ANALYZE
SELECT id, name
FROM  products
WHERE name ILIKE '%laptop%';

-- Output (simplified)
-- Seq Scan on products  (cost=0.00..18450.00 rows=5 width=36)
--   Filter: ((name)::text ~~* '%laptop%'::text)
--   Rows Removed by Filter: 4 999 995
-- Planning Time: 0.2 ms
-- Execution Time: 28 940.7 ms   ← 29 seconds!

Notice Seq Scan — every row examined, 4,999,995 rows thrown away, execution time ~29 s.

04

The Inverted Index — The Core Invention

"Instead of going through every document to find the term, maintain a map from every term to the documents that contain it."

This insight — inverting the search — is what the name captures. A normal index maps document → words. An inverted index maps word → documents.

How It Is Built

When a document is first stored (or updated), the search engine runs it through an analysis pipeline:

  1. Tokenisation — break text into individual tokens (words). "Introduction to Machine Learning" → ["introduction", "to", "machine", "learning"]
  2. Normalisation / Lowercasing — convert to lowercase so "Machine" and "machine" are the same token.
  3. Stop-word removal — drop common words with no discriminative value ("to", "the", "a"). (Optional, configurable.)
  4. Stemming / Lemmatisation — reduce words to their root: "running" → "run", "searches" → "search". So a query for "searching" also matches "searched".
  5. Index entry creation — for each resulting term, record the document ID and position.
Documents Doc 1: "Intro to Machine Learning" Doc 2: "The Machine Age" Doc 3: "Coffee Machine Manual" Doc 4: "Deep Learning Fundamentals" Analyse & Index Inverted Index TERM POSTINGS (doc, positions) machine Doc1(p1,15,23) Doc2(p5,89) Doc3(p1) learning Doc1(p1,16,24) Doc4(p2,8,45) intro Doc1(p1) coffee Doc3(p1,4,12) deep Doc4(p1,3) Query: "machine learning" → Intersect postings → ranked results in O(log n)
Fig 2 — Inverted index: term → posting list (document IDs + positions). Lookup is O(1) hash or O(log n) B-tree.
The key insight: inverted index lookup time depends on the number of matching documents, not the total number of documents. Millions of documents = no problem, as long as the hit list is small.
05

Elasticsearch — What It Is & How It Works

Elasticsearch is a distributed, JSON-document search engine built on top of Apache Lucene — the battle-tested Java library that implements the inverted index and BM25 scoring. Lucene has been around since 1999; Elasticsearch (2010) wraps it in a REST API and adds distributed clustering, replication, and a rich query DSL.

Key characteristics:

Creating an Index with Field Mapping

HTTP / Elasticsearch DSLPUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",          // analysed — tokenised, stemmed
        "boost": 3            // title matches are 3× more relevant
      },
      "description": {
        "type": "text",
        "boost": 1.5
      },
      "category": {
        "type": "keyword"     // not analysed — exact match only
      },
      "price": { "type": "float" }
    }
  }
}

The difference between text and keyword is critical:

Mapping TypeAnalysed?Use For
textYes — tokenised, stemmedProduct names, descriptions, reviews
keywordNo — stored as-isStatus codes, tags, category IDs

A Full-Text Search Query

HTTP / Elasticsearch DSLGET /products/_search
{
  "query": {
    "multi_match": {
      "query": "laptop",
      "fields": ["name^3", "description^1.5", "category"],
      "fuzziness": "AUTO"    // typo tolerance
    }
  },
  "size": 10
}

^3 is field boosting — matches in name contribute 3× as much to the relevance score as matches in the default field weight.

06

BM25 — How Relevance Scoring Works

BM25 (Best Match 25) is the default ranking function in Elasticsearch (and also in Postgres FTS). It produces a floating-point score per document; higher = more relevant. The full formula has several components — but conceptually four factors drive the score:

BM25 Relevance Score Term Frequency (TF) How often term appears in THIS document Inverse Doc Freq (IDF) How rare the term is across ALL documents Document Length Normalise for long docs (shorter = higher boost) Field Boosting title > description > content (configurable)
Fig 3 — The four pillars of BM25 relevance scoring in Elasticsearch.
FactorWhat It MeasuresEffect on Score
Term FrequencyCount of query term in documentHigher = more relevant (with diminishing returns)
Inverse Doc FrequencyRarity of term across indexRare terms carry more signal than common ones
Document LengthLength vs. avg document lengthShort docs with the term rank higher than long docs
Field BoostWhich field the term appears inTitle match > body match (configurable multiplier)
You don't need to understand the BM25 math to use Elasticsearch productively. Just know: field-boosted + term-frequency + IDF = the score. Tune boosts in mapping and query DSL.

Why IDF Matters

Consider the word "the" — it appears in virtually every English document, so it has very high document frequency and a very low IDF weight. It contributes almost nothing to scores. A rare word like "photovoltaic" or "elasticsearch" appearing in a document is a strong signal of topical relevance.

07

Typo Tolerance — Fuzzy Search

Elasticsearch uses Levenshtein edit distance under the hood for fuzzy matching. Edit distance is the minimum number of single-character operations (insert, delete, substitute) needed to transform one word into another.

Elasticsearch DSL — fuzzy query{
  "query": {
    "fuzzy": {
      "name": {
        "value": "lapto",
        "fuzziness": "AUTO",  // 0 for 1-2 chars, 1 for 3-5, 2 for 6+
        "prefix_length": 1       // first N chars must match exactly
      }
    }
  }
}

"fuzziness": "AUTO" is the recommended setting. It automatically adjusts the allowed edit distance based on word length. prefix_length: 1 prevents matching completely unrelated words that happen to be 1–2 edits away.

Type-ahead / autocomplete interfaces like those on Amazon or Google use this capability. As you type "trendin", Elasticsearch finds "trending" with fuzziness=1 and returns suggestions in <50 ms.
08

The ELK Stack — Logs & Observability

Elasticsearch isn't only for product search. It's the "E" in the famous ELK Stack used for centralised log management and observability.

Logstash Collect & Parse Logs Elasticsearch Index & Search Logs Kibana Visualise & Dashboard Beats (Filebeat / Metricbeat) → lightweight agents that ship logs/metrics to Logstash or directly to ES
Fig 4 — The ELK Stack. Logstash ingests & parses, Elasticsearch indexes, Kibana visualises.

If your company already runs ELK for logging, adding product/content search on top of the same Elasticsearch cluster is a natural, cost-effective choice — you're already paying for the infrastructure and your ops team already knows it.

09

Postgres Full-Text Search in Go

Postgres has had native full-text search since version 8.3 via the tsvector / tsquery types. It uses its own inverted index (the GIN index) and a BM25-variant scorer called ts_rank. It's not as feature-rich as Elasticsearch but perfectly adequate for moderate search needs on data already in Postgres.

DDL — Add a GIN Index

SQL-- Add a generated tsvector column and index it
ALTER TABLE products
  ADD COLUMN search_vec tsvector
    GENERATED ALWAYS AS (
      setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
      setweight(to_tsvector('english', coalesce(description, '')), 'B')
    ) STORED;

CREATE INDEX idx_products_fts ON products USING GIN(search_vec);

setweight('A') means title matches outrank description matches — the Postgres equivalent of field boosting.

Go — Full-Text Search Handler

Gopackage main

import (
    "context"
    "fmt"
    "log"
    "net/http"

    "github.com/jackc/pgx/v5/pgxpool"
)

type Product struct {
    ID          int
    Name        string
    Description string
    Rank        float64
}

func searchHandler(pool *pgxpool.Pool) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        query := r.URL.Query().Get("q")
        if query == "" {
            http.Error(w, "missing query param q", http.StatusBadRequest)
            return
        }

        // plainto_tsquery converts plain text safely (no special chars needed)
        // websearch_to_tsquery also supports AND/OR/-term syntax
        sql := `
            SELECT id, name, description,
                   ts_rank(search_vec, plainto_tsquery('english', $1)) AS rank
            FROM   products
            WHERE  search_vec @@ plainto_tsquery('english', $1)
            ORDER  BY rank DESC
            LIMIT  20
        `

        rows, err := pool.Query(context.Background(), sql, query)
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }
        defer rows.Close()

        var results []Product
        for rows.Next() {
            var p Product
            rows.Scan(&p.ID, &p.Name, &p.Description, &p.Rank)
            results = append(results, p)
        }

        w.Header().Set("Content-Type", "application/json")
        fmt.Fprintf(w, "found %d results\n", len(results))
    }
}

func main() {
    pool, _ := pgxpool.New(context.Background(), "postgres://...")
    http.HandleFunc("/search", searchHandler(pool))
    log.Fatal(http.ListenAndServe(":8080", nil))
}
The @@ operator is Postgres's full-text match operator. It uses the GIN index — query time stays O(log n) regardless of table size.
10

Elasticsearch in Python (Official SDK)

The official Python client is elasticsearch-py. Install:

Shellpip install elasticsearch

Index Setup & Bulk Insert

Pythonfrom elasticsearch import Elasticsearch, helpers
import csv

es = Elasticsearch(
    cloud_id="YOUR_CLOUD_ID",
    api_key="YOUR_API_KEY"
)

INDEX = "reviews"

# 1. Create index with explicit mapping
if es.indices.exists(index=INDEX):
    es.indices.delete(index=INDEX)

es.indices.create(index=INDEX, body={
    "mappings": {
        "properties": {
            "review":    {"type": "text"},     # analysed full-text
            "sentiment": {"type": "keyword"}  # exact: "positive" / "negative"
        }
    }
})

# 2. Bulk insert from CSV
def generate_docs(path: str):
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            if row.get("review") and row.get("sentiment"):
                yield {
                    "_index": INDEX,
                    "_source": {
                        "review":    row["review"],
                        "sentiment": row["sentiment"]
                    }
                }

success, errors = helpers.bulk(es, generate_docs("reviews.csv"))
print(f"Inserted {success} documents, {len(errors)} errors")

Search with Fuzzy + Field Boost

Pythondef search_reviews(query: str, sentiment_filter: str = None) -> list:
    must_clauses = [{
        "multi_match": {
            "query":     query.lower(),
            "fields":   ["review"],
            "fuzziness": "AUTO",
            "operator":  "and"   # all terms must appear
        }
    }]

    filter_clauses = []
    if sentiment_filter:
        filter_clauses.append({
            "term": {"sentiment": sentiment_filter}
        })

    body = {
        "query": {"bool": {"must": must_clauses, "filter": filter_clauses}},
        "size": 20
    }

    resp = es.search(index=INDEX, body=body)
    return [
        {"score": hit["_score"], **hit["_source"]}
        for hit in resp["hits"]["hits"]
    ]

# Usage
results = search_reviews("gret product", sentiment_filter="positive")
# "gret" (typo for "great") matched via fuzziness=AUTO (edit distance 1)

Streaming Results from Two Sources (Node.js Pattern)

The demo in the lecture used a Next.js API route that streams results from Postgres and Elasticsearch simultaneously so neither waits for the other. Here is the pattern in pure Node.js:

Node.js (ESM)import { Client } from '@elastic/elasticsearch'
import { neon }     from '@neondatabase/serverless'

export async function GET(req) {
  const { searchParams } = new URL(req.url)
  const term = searchParams.get('q')

  const encoder = new TextEncoder()
  const stream = new ReadableStream({
    async start(controller) {
      // Fire BOTH queries concurrently — don't await sequentially
      const [pgResult, esResult] = await Promise.allSettled([
        pgSearch(term),
        esSearch(term),
      ])

      controller.enqueue(encoder.encode(JSON.stringify({
        source: 'postgres',
        ...pgResult
      })))
      controller.enqueue(encoder.encode(JSON.stringify({
        source: 'elasticsearch',
        ...esResult
      })))
      controller.close()
    }
  })
  return new Response(stream, { headers: { 'Content-Type': 'text/event-stream' } })
}
11

The Benchmark — ILIKE vs Elasticsearch on 50k Rows

The demo in the lecture populated 50,000 review documents into both Neon (cloud Postgres) and Elastic Cloud, both hosted in us-west to keep network latency equal. Results:

Search Latency on 50 000 rows 3 s 7 s 0 500ms ES: "laptop" 500ms ES: "only" ~3s PG: "laptop" ~7.5s PG: "only"
Fig 5 — Postgres ILIKE scales poorly with wider matches; Elasticsearch stays flat at ~500ms.
QueryElasticsearchPostgres ILIKESpeedup
"laptop"~500 ms~3 s
"only" (broad)~500 ms~7.5 s15×

The broader the search term (more matching rows), the worse ILIKE gets — because Postgres must retrieve and return all those rows from disk. Elasticsearch computes relevance scores in-memory on the shard and only returns the top-N, so latency stays low.

12

When to Use What

Postgres Full-Text Search

  • Data already in Postgres
  • Search is a secondary feature, not the core product
  • Team small / no dedicated infra for ES
  • Moderate data volume (< a few million rows)
  • Don't need fuzzy/typo search on day one

Elasticsearch

  • Search is a first-class feature
  • Millions+ of documents
  • Need typo tolerance / autocomplete
  • Complex relevance tuning needed
  • Already running ELK for logs anyway
  • Need aggregations / analytics on search results
As the lecturer noted: "You don't have to master Elasticsearch the way you must master databases. Most search use cases are covered by copy-pasting from the official docs or an LLM." Know when to reach for the tool; the tool's docs will handle the rest.

Decision Flow

Need text search? Millions of docs or search-first product? Yes Elasticsearch / OpenSearch No Postgres tsvector+GIN Both route through Apache Lucene (or compatible) under the hood. Master your database first; search is the second skill to add.
Fig 6 — Quick decision tree for choosing your search backend.
13

References & Further Reading


Backend Field Manual · Full-Text Search Chapter · Notes compiled from video lecture